The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). (Source: https://www.kaggle.com/competitions/titanic)
This Notebook focuses on Stacking to predict if the "test" passengers survive or not. We include a total of 5 models in the stacking for which we did hyperparameter tuning and provide feature importance. Before building the predictive model, we do a brief EDA followed by feature selection and feature engineering.
#libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
from sklearn.svm import SVC
from sklearn.inspection import permutation_importance
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.ensemble import StackingClassifier
train = pd.read_csv('titanic\\train.csv')
test = pd.read_csv('titanic\\test.csv')
gender = pd.read_csv('titanic\\gender_submission.csv')
train.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 417 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB
We have different numerical and categorical features, some of which have missing values:
Some Interesting questions for EDA:
#How different is the survival ratio for males/females?
total = len(train)
no_male = train["Sex"].value_counts()["male"]
no_females = train["Sex"].value_counts()["female"]
survived_male = train[train.Sex.str.match("male")]["Survived"].value_counts()[1]
survived_females= train[train.Sex.str.match("female")]["Survived"].value_counts()[1]
print(f"Out of {no_male} males, {survived_male} survived. Their survivability rate is {survived_male/no_male*100}%.")
print(f"Out of {no_females} females, {survived_females} survived. Their survivability rate is {survived_females/no_females*100}%.")
#PLOTTING
fig, ax = plt.subplots(figsize= (10,10*0.618))
height = [survived_male/no_male*100, survived_females/no_females*100]
bars = ('Male', 'Female')
x_pos = np.arange(len(bars))
# Create bars and choose color
plt.bar(x_pos, height, color = ['blue','cyan'])
# Add title and axis names
plt.title('Survival Rate is considerably higher for women than men')
plt.xlabel('Gender')
plt.ylabel('Survival Rate %')
plt.xticks(x_pos, bars)
ax.set_ylim([0, 100]) #gives better perspective of percentages
# Show graph
plt.show()
Out of 577 males, 109 survived. Their survivability rate is 18.890814558058924%. Out of 314 females, 233 survived. Their survivability rate is 74.20382165605095%.
# Passenger distribution and survival based on age?
#Not using imputed values
#Data
survived = train[train['Survived'] == 1]['Age']
died = train[train['Survived'] == 0]['Age']
#plot
fig, ax = plt.subplots(figsize = (10,10*0.618))
ax.hist([survived, died], bins=8, stacked=True, density=False, edgecolor='black')
ax.legend(("survived","died"))
plt.title('Survivors by Age Group')
plt.xlabel('Age')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
Overall the survivability rate seems to be similar among age groups with the exception of young children aged 0 to 10 (smaller
#Passenger survival based on fare?
#Data
survived = train[train['Survived'] == 1]['Fare']
died = train[train['Survived'] == 0]['Fare']
#plot
fig, ax = plt.subplots(2, figsize = (10,10))
ax[0].hist([survived, died], bins=20, stacked=True, density=False, edgecolor='black')
ax[0].legend(("survived","died"))
ax[0].set_title('Survivors by Fare Group (skewness makes it hard to see survivability in high fare groups)')
ax[0].set_xlabel('Fare')
ax[0].set_ylabel('Count')
ax[1].hist([survived, died], log = True, bins=20, density=False, edgecolor='black')
ax[1].legend(("survived","died"))
ax[1].set_title('Survivors by Fare Group using log scale and not stacked')
ax[1].set_xlabel('Fare')
ax[1].set_ylabel('Count')
Text(0, 0.5, 'Count')
The passengers in the group with higher Fares seem to have a higher survivability rate than the ones with the lowest Fares
#Passenger survival based on Passenger class?
train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived')
| Pclass | Survived | |
|---|---|---|
| 2 | 3 | 0.242363 |
| 1 | 2 | 0.472826 |
| 0 | 1 | 0.629630 |
#Do lone passengers have the same survivability as those accompanied?
train.loc[(train['SibSp'] > 0)|(train['Parch'] > 0), 'Company'] = 1
train.loc[(train['SibSp'] == 0)&(train['Parch'] == 0), 'Company'] = 0
no_alone = train["Company"].value_counts()[0]
no_accompanied = train["Company"].value_counts()[1]
survived_alone = train.loc[(train["Company"] == 1)]["Survived"].value_counts()[0]
survived_accompanied = train.loc[(train["Company"] == 1)]["Survived"].value_counts()[1]
print(f"Out of {no_alone} alone, {survived_alone} survived. Their survivability rate is {survived_alone/no_alone*100}%.")
print(f"Out of {no_accompanied} accompanied, {survived_accompanied} survived. Their survivability rate is {survived_accompanied/no_accompanied*100}%.")
Out of 537 alone, 175 survived. Their survivability rate is 32.588454376163874%. Out of 354 accompanied, 179 survived. Their survivability rate is 50.56497175141242%.
#Do titles correlate with survivability?
#creating df with titles from names
train['Title'] = train['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
train['Title'].value_counts()
Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Col 2 Major 2 Mlle 2 Capt 1 Lady 1 Don 1 Mme 1 Ms 1 Countess 1 Sir 1 Jonkheer 1 Name: Title, dtype: int64
pd.crosstab(train['Title'], train['Survived'])
| Survived | 0 | 1 |
|---|---|---|
| Title | ||
| Capt | 1 | 0 |
| Col | 1 | 1 |
| Countess | 0 | 1 |
| Don | 1 | 0 |
| Dr | 4 | 3 |
| Jonkheer | 1 | 0 |
| Lady | 0 | 1 |
| Major | 1 | 1 |
| Master | 17 | 23 |
| Miss | 55 | 127 |
| Mlle | 0 | 2 |
| Mme | 0 | 1 |
| Mr | 436 | 81 |
| Mrs | 26 | 99 |
| Ms | 0 | 1 |
| Rev | 6 | 0 |
| Sir | 0 | 1 |
At first sight, most titles seem to relate either to gender, age or profession. Some appear with very low frequency such as Don, Capt or Lady, while others are very common: Master, Mrs, Miss & Mr. We also observe survivability highly varies among the different title groups with those pointing to females and to higher socioeconomic status having higher survivability.
#Important note, to include gender in the correlation matrix, we change the 'Sex' column to 1 for males and 0 for females
#matrix only includes numerical features
train['Sex'].replace(['male','female'], [1,0], inplace = True)
test['Sex'].replace(['male','female'], [1,0], inplace = True)
#Correlation Matrix
fig = px.imshow(train[['Survived','Pclass','Age','SibSp', 'Parch', 'Sex', 'Fare', 'Company']].corr(), title='Correlation Matrix')
fig.show()
Gender correlates the most with survivability, followed by Fare and Passenger Class
We basically have a classification problem to handle: we want to know if a passenger will survive or not depending on their features
We'll run a simple logistics regression, but first we should do a bit of data engineering first in order to prepare the data
# Merge both data for preprocessing of data so we don't have to do it twice
total = pd.concat([test.assign(ind="test"), train.assign(ind="train")], ignore_index=True)
#added indicator column while concatenating the two dataframes, so we can later seperate them again
#Imputing missing age values with age average of similar passenger groups
total['Age'] = total['Age'].groupby(by=[total['Pclass'], total['Sex']])\
.apply(lambda x: x.fillna(x.mean())) #lambda function will allow to apply mean age of each subset
#Imputing missing Embarked with mode
total['Embarked'] = total['Embarked'].fillna(total['Embarked'].mode()[0])
#Imputing missing fare with median (since there are fare outliers, mean doesn't seem to be as optimal as median)
total['Fare'] = total['Fare'].fillna(total['Fare'].median())
#Dropping columns that have no use (PassengerId, Ticket) or very little data (Cabin)
total.drop(columns=['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace = True)
#Adding column for wether a passenger is alone or not
total.loc[(total['SibSp'] > 0)|(total['Parch'] > 0), 'Company'] = 1
total.loc[(total['SibSp'] == 0)&(total['Parch'] == 0), 'Company'] = 0
From the correlation matrix we also know 'SibSp' and 'Parch' have little correlation with survivability, whereas the variable we derived from those two ('Company') shows a fairly high correlation. To avoid a multicollinearity issue, we drop both 'SibSp' and 'Parch'.
total.drop(['SibSp', 'Parch'], axis = 1, inplace = True)
total.head(5)
| Pclass | Name | Sex | Age | Fare | Embarked | ind | Survived | Company | Title | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | Kelly, Mr. James | 1 | 34.5 | 7.8292 | Q | test | NaN | 0.0 | NaN |
| 1 | 3 | Wilkes, Mrs. James (Ellen Needs) | 0 | 47.0 | 7.0000 | S | test | NaN | 1.0 | NaN |
| 2 | 2 | Myles, Mr. Thomas Francis | 1 | 62.0 | 9.6875 | Q | test | NaN | 0.0 | NaN |
| 3 | 3 | Wirz, Mr. Albert | 1 | 27.0 | 8.6625 | S | test | NaN | 0.0 | NaN |
| 4 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | 0 | 22.0 | 12.2875 | S | test | NaN | 1.0 | NaN |
categorical_columns = ['Pclass', 'Embarked']
for column in categorical_columns:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_total = pd.DataFrame(encoder.fit_transform(total[[column]]))
# get OHE column names
encoded_total.columns = encoder.get_feature_names([column])
# One-hot encoding removed index; put it back
encoded_total.index = total.index
# Remove categorical columns (will replace with one-hot encoding)
rem_total = total.drop(column, axis=1)
# Add one-hot encoded columns to numerical features
total = pd.concat([rem_total, encoded_total], axis=1)
total.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 1309 non-null object 1 Sex 1309 non-null int64 2 Age 1309 non-null float64 3 Fare 1309 non-null float64 4 ind 1309 non-null object 5 Survived 891 non-null float64 6 Company 1309 non-null float64 7 Title 891 non-null object 8 Pclass_1 1309 non-null float64 9 Pclass_2 1309 non-null float64 10 Pclass_3 1309 non-null float64 11 Embarked_C 1309 non-null float64 12 Embarked_Q 1309 non-null float64 13 Embarked_S 1309 non-null float64 dtypes: float64(10), int64(1), object(3) memory usage: 143.3+ KB
#We will drop the name column and replace it with the Title feature
#As seen from our EDA we have 4 very common titles and the remaining are rare ones
#we form 5 different categorical titles and one hot encode them to "feed the machine"
total['Title'] = total['Name'].str.extract(' ([A-Za-z]+)\.')
total['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir',\
'Jonkheer', 'Dona', 'Mlle', 'Ms', 'Mme'], 'Rare', inplace = True)
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
total['Title'] = total['Title'].map(title_mapping)
total.loc[total['Sex'] == 1].Title.fillna(1, inplace = True) #Mr is the most common denomination for males
total.loc[total['Sex'] == 0].Title.fillna(2, inplace = True) #Miss is the most common denomination for females
#total.drop('Name', axis=1, inplace = True)
C:\Users\t_bor\Anaconda\Anaconda_NEW\lib\site-packages\pandas\core\series.py:4517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\t_bor\Anaconda\Anaconda_NEW\lib\site-packages\pandas\core\series.py:4517: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#drop name column
total.drop('Name', axis=1, inplace = True)
total.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 1309 non-null int64 1 Age 1309 non-null float64 2 Fare 1309 non-null float64 3 ind 1309 non-null object 4 Survived 891 non-null float64 5 Company 1309 non-null float64 6 Title 1309 non-null int64 7 Pclass_1 1309 non-null float64 8 Pclass_2 1309 non-null float64 9 Pclass_3 1309 non-null float64 10 Embarked_C 1309 non-null float64 11 Embarked_Q 1309 non-null float64 12 Embarked_S 1309 non-null float64 dtypes: float64(10), int64(2), object(1) memory usage: 133.1+ KB
#get PassengerId column for the submission csv before replacing test with the treated df
test_submission_labels = test['PassengerId']
#split data back into test and train:
test, train = total[total["ind"].eq("test")], total[total["ind"].eq("train")]
#drop ind
train.drop('ind', axis = 1, inplace = True)
test.drop('ind', axis = 1, inplace = True)
C:\Users\t_bor\Anaconda\Anaconda_NEW\lib\site-packages\pandas\core\frame.py:4163: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
#First we define what our target is: 'Survived'
X = train.drop(columns='Survived')
y = train['Survived']
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
test_size=0.25)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(668, 11) (223, 11) (668,) (223,)
#Hyperparameter tuning, chosing the best regularization for the model
param_grid = {'C': [0.005, 0.01, 0.05, 0.1, 0.5, 1]}
grid = GridSearchCV(LogisticRegression(solver='liblinear'), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score (f1): {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
Best cross-validation score (f1): 0.80
Best parameters: {'C': 1}
#seeing the feature importance when using the best parameters for LR
best_LR =LogisticRegression(solver='liblinear', C = 0.01).fit(X_train, y_train)
importance = best_LR.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
fig, ax = plt.subplots(figsize=(15,5))
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.xticks(range(len(importance)), X_train)
pyplot.show()
Feature: 0, Score: -0.49627 Feature: 1, Score: -0.02482 Feature: 2, Score: 0.00895 Feature: 3, Score: -0.02964 Feature: 4, Score: 0.34101 Feature: 5, Score: 0.12766 Feature: 6, Score: 0.08107 Feature: 7, Score: -0.29350 Feature: 8, Score: 0.06582 Feature: 9, Score: -0.00761 Feature: 10, Score: -0.14298
param_grid = {
'max_depth': [40, 80, 120],
'n_estimators': [40, 80, 120, 160]
}
rf_clf = RandomForestClassifier()
grid_search = GridSearchCV(estimator = rf_clf, param_grid = param_grid,
cv = 5, n_jobs = -1)
grid_search.fit(X_train, y_train)
print("Best cross-validation score (f1): {:.2f}".format(grid_search.best_score_))
print("Best params:\n{}\n".format(grid_search.best_params_))
Best cross-validation score (f1): 0.81
Best params:
{'max_depth': 40, 'n_estimators': 80}
#seeing the feature importance when using the best parameters for RF
best_rf = RandomForestClassifier(max_depth = 80, n_estimators = 40).fit(X_train, y_train)
importance = best_rf.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
fig, ax = plt.subplots(figsize=(15,5))
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.xticks(range(len(importance)), X_train)
pyplot.show()
Feature: 0, Score: 0.12644 Feature: 1, Score: 0.24461 Feature: 2, Score: 0.28329 Feature: 3, Score: 0.02449 Feature: 4, Score: 0.18885 Feature: 5, Score: 0.02596 Feature: 6, Score: 0.01548 Feature: 7, Score: 0.05451 Feature: 8, Score: 0.01236 Feature: 9, Score: 0.00923 Feature: 10, Score: 0.01478
param_grid = {'C': [0.1,1, 10, 100]}
clf_svm = SVC(gamma='auto')
grid = GridSearchCV(estimator = clf_svm, param_grid = param_grid,
cv = 5)
grid.fit(X_train,y_train)
print("Best cross-validation score (f1): {:.2f}".format(grid.best_score_))
print("Best params:\n{}\n".format(grid.best_params_))
Best cross-validation score (f1): 0.71
Best params:
{'C': 1}
#seeing the feature importance when using the best parameters for svm
best_SVC = SVC(gamma='auto',C = 100).fit(X_train, y_train)
perm_importance = permutation_importance(best_SVC, X_test, y_test)
feature_names = X_train.columns
features = np.array(feature_names)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(features[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
Text(0.5, 0, 'Permutation Importance')
param_grid = {'n_neighbors': [3, 5, 7, 9, 11]}
knn_grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv= 5, scoring = 'f1')
knn_grid.fit(X_train, y_train)
knn_model = knn_grid.best_estimator_
print("Best cross-validation score (f1): {:.2f}".format(knn_grid.best_score_))
print("Best params:\n{}\n".format(knn_grid.best_params_))
Best cross-validation score (f1): 0.61
Best params:
{'n_neighbors': 3}
#seeing the feature importance when using the best parameters for KNN
best_knn = KNeighborsClassifier(n_neighbors = 9).fit(X_train, y_train)
# summarize feature importance
perm_importance = permutation_importance(best_knn, X_test, y_test)
feature_names = X_train.columns
features = np.array(feature_names)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(features[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
Text(0.5, 0, 'Permutation Importance')
param_grid = {'n_estimators': range(20,110,10)}
GB_grid = GridSearchCV(GradientBoostingClassifier(), param_grid, cv= 5, scoring = 'f1')
GB_grid.fit(X_train, y_train)
GB_model = GB_grid.best_estimator_
print("Best cross-validation score (f1): {:.2f}".format(GB_grid.best_score_))
print("Best params:\n{}\n".format(GB_grid.best_params_))
Best cross-validation score (f1): 0.76
Best params:
{'n_estimators': 50}
#seeing the feature importance when using the best parameters for GB
# define the model
model = GradientBoostingClassifier(n_estimators = 50)
# fit the model
model.fit(X_train, y_train)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
fig, ax = plt.subplots(figsize=(15,5))
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.xticks(range(len(importance)), X_train)
pyplot.show()
Feature: 0, Score: 0.02267 Feature: 1, Score: 0.07531 Feature: 2, Score: 0.17720 Feature: 3, Score: 0.00033 Feature: 4, Score: 0.55273 Feature: 5, Score: 0.00964 Feature: 6, Score: 0.00218 Feature: 7, Score: 0.14072 Feature: 8, Score: 0.00613 Feature: 9, Score: 0.00465 Feature: 10, Score: 0.00845
As seen from the feature importances we ploted for each of the 5 models Fare, Title, Age and Gender are usually the most important, while Embarking location seems to be irrelevant in most cases, nevertheless we proceed and don't change anymore features for our final model.
estimators = [('Logistic_optimized', LogisticRegression(solver='liblinear', C = 0.01)),
('RForest_optimized', RandomForestClassifier(max_depth = 80, n_estimators = 80)),
('SVM_optimized', SVC(gamma='auto',C = 100)),
('KNN_optimized', KNeighborsClassifier(n_neighbors = 9)),
('GB_optimized', GradientBoostingClassifier(n_estimators = 50))]
stacking = StackingClassifier(estimators)
stacking.fit(X_train, y_train)
list_of_estimators = [('Logistic_optimized', LogisticRegression(solver='liblinear', C = 0.01)),
('RForest_optimized', RandomForestClassifier(max_depth = 80, n_estimators = 80)),
('SVM_optimized', SVC(gamma='auto',C = 100)),
('KNN_optimized', KNeighborsClassifier(n_neighbors = 9)),
('GB_optimized', GradientBoostingClassifier(n_estimators = 50)),
('stacking', stacking)]
for label, model in list_of_estimators:
cv_scores = cross_validate(model, X_train, y_train, cv=5, scoring=('f1','roc_auc'))
print(f"F1-score: {cv_scores['test_f1'].mean():0.4f} (+/- {cv_scores['test_f1'].std():0.4f}) | ROC-AUC: {cv_scores['test_roc_auc'].mean():0.4f} (+/- {cv_scores['test_roc_auc'].std():0.4f}) [{label}]")
F1-score: 0.6337 (+/- 0.0588) | ROC-AUC: 0.8119 (+/- 0.0606) [Logistic_optimized] F1-score: 0.7319 (+/- 0.0377) | ROC-AUC: 0.8596 (+/- 0.0295) [RForest_optimized] F1-score: 0.5719 (+/- 0.0353) | ROC-AUC: 0.7543 (+/- 0.0325) [SVM_optimized] F1-score: 0.5768 (+/- 0.0586) | ROC-AUC: 0.7371 (+/- 0.0356) [KNN_optimized] F1-score: 0.7582 (+/- 0.0473) | ROC-AUC: 0.8615 (+/- 0.0332) [GB_optimized] F1-score: 0.7445 (+/- 0.0429) | ROC-AUC: 0.8633 (+/- 0.0326) [stacking]
By stacking the above models, we combine the predictions of the models mentioned above, and although we achieve a similar score to the best individual model we had, we go ahead and use the stacking classifier to predict the survival of the passengers in the testing data which also helps us not to overfit the models.
#preparing
X_test = test.drop(columns='Survived')
X_test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 418 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 418 non-null int64 1 Age 418 non-null float64 2 Fare 418 non-null float64 3 Company 418 non-null float64 4 Title 418 non-null int64 5 Pclass_1 418 non-null float64 6 Pclass_2 418 non-null float64 7 Pclass_3 418 non-null float64 8 Embarked_C 418 non-null float64 9 Embarked_Q 418 non-null float64 10 Embarked_S 418 non-null float64 dtypes: float64(9), int64(2) memory usage: 39.2 KB
stacking.fit(X_train, y_train)
y_pred = stacking.predict(X_test)
y_pred.astype(int)
#we need the results in integer type
array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1])
submission = pd.DataFrame({
"PassengerId": test_submission_labels,
"Survived": y_pred.astype(int)
})
submission.to_csv('../submission.csv', index=False)